31.10.2019
“This is a Massive Open Online Course (MOOC) meaning that everything you need to complete the course in terms of materials and exercises will be freely available online.”
When I first started to think about learning online. I realized that this is a good opportunity for me because like all of us, as well I have very limited time to use to learn new skills. MOOC concept is equal for everybody and benefits everybody of us who have “whatever reason” could not come to a traditional classroom setting. There is so much I wish to learn about using R and data analytics. Fortunately, I have quite good basic knowledge about biostatistics but I have only very basic skills using R.
After the first exercise, I found that online learning does seem to take at least the same time as traditional classroom learning, but you can decide when you put your effort into learning. I heard about this course from the UEF´s Doctoral Programme in Clinical Research coordinator and prof. Reijo Sund.
You can find my GitHub repository from here
Br,
Juuso
The theme for the week 2 was regression analysis. Week 2 exercises consist of 1) data wrangling exercises and 2) data analysis exercises. You can find results of my second week below.
# read the data into memory
std14 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", sep=",", header=TRUE)
The dataset consist from the 7 different variables (gender (factor), age (int), attitude (num), deep (num), stra(num), surf(num), and point(int)) and 166 observations. I excluded from the data those observations where the exam points were 0. You can find variables names and short descriptions and some basic charasteristics about the data below:
#Explore structure and dimensions of the dataset
str(std14)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(std14)
## [1] 166 7
summary(std14)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
According the graphical overview, age and gender variables are skewed but all the others variables are fairly normally distributed.
# Access the tidyverse libraries tidyr, dplyr, ggplot2
library(tidyr); library(dplyr); library(ggplot2); library(corrplot)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## corrplot 0.84 loaded
glimpse(std14)
## Observations: 166
## Variables: 7
## $ gender <fct> F, M, F, M, M, F, M, F, M, F, M, F, F, F, M, F, F, F, M, F...
## $ age <int> 53, 55, 49, 53, 49, 38, 50, 37, 37, 42, 37, 34, 34, 34, 35...
## $ attitude <dbl> 3.7, 3.1, 2.5, 3.5, 3.7, 3.8, 3.5, 2.9, 3.8, 2.1, 3.9, 3.8...
## $ deep <dbl> 3.583333, 2.916667, 3.500000, 3.500000, 3.666667, 4.750000...
## $ stra <dbl> 3.375, 2.750, 3.625, 3.125, 3.625, 3.625, 2.250, 4.000, 4....
## $ surf <dbl> 2.583333, 3.166667, 2.250000, 2.250000, 2.833333, 2.416667...
## $ points <int> 25, 12, 24, 10, 22, 21, 21, 31, 24, 26, 31, 31, 23, 25, 21...
gather(std14) %>% glimpse
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Observations: 1,162
## Variables: 2
## $ key <chr> "gender", "gender", "gender", "gender", "gender", "gender", "...
## $ value <chr> "F", "M", "F", "M", "M", "F", "M", "F", "M", "F", "M", "F", "...
# draw a bar plot of each variable and add frequency count labels above the bars
gather(std14) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()+ geom_text(stat='count', aes(label=..count..), vjust=-1)
## Warning: attributes are not identical across measure variables;
## they will be dropped
My aim was was find out the relationship between the exam points and attitude, age, and gender. Practically that mean how attitude, age, and gender associated with the achieved exam points in this population. First of all I made a correlation matrix (see below). Correlation is described as the analysis which lets us know the association or the absence of the relationship between two variables ‘x’ and ‘y’.
A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A positive correlation mean a direct association between the two variables and a negative correlation a inverse association between two variables. If we focus on my main aim, we can found a positive correlation between points, gender (R=0.093) and attitude (R=0.436) and a negative correlation between points and age (R=0.093).
# convert gender as integer
std14$gender <- as.integer(std14$gender)
# calculate the correlation matrix and round it
cor.matrix <- cor(std14)
head(round(cor.matrix,2))
## gender age attitude deep stra surf points
## gender 1.00 0.12 0.29 0.06 -0.15 -0.11 0.09
## age 0.12 1.00 0.02 0.03 0.10 -0.14 -0.09
## attitude 0.29 0.02 1.00 0.11 0.06 -0.18 0.44
## deep 0.06 0.03 0.11 1.00 0.10 -0.32 -0.01
## stra -0.15 0.10 0.06 0.10 1.00 -0.16 0.15
## surf -0.11 -0.14 -0.18 -0.32 -0.16 1.00 -0.14
cor.matrix
## gender age attitude deep stra surf
## gender 1.00000000 0.11901733 0.29423035 0.05809597 -0.14552789 -0.1126999
## age 0.11901733 1.00000000 0.02220071 0.02507804 0.10244409 -0.1414052
## attitude 0.29423035 0.02220071 1.00000000 0.11024302 0.06174177 -0.1755422
## deep 0.05809597 0.02507804 0.11024302 1.00000000 0.09650255 -0.3238020
## stra -0.14552789 0.10244409 0.06174177 0.09650255 1.00000000 -0.1609729
## surf -0.11269987 -0.14140516 -0.17554218 -0.32380198 -0.16097287 1.0000000
## points 0.09290782 -0.09319032 0.43652453 -0.01014849 0.14612247 -0.1443564
## points
## gender 0.09290782
## age -0.09319032
## attitude 0.43652453
## deep -0.01014849
## stra 0.14612247
## surf -0.14435642
## points 1.00000000
# visualize the correlation matrix
corrplot(cor.matrix, method = "number")
After correlation analysis I made and a regression analysis. Regression analysis, predicts the value of the dependent variable based on the known value of the independent variable, assuming that average mathematical relationship between two or more variables.
# create a regression model with multiple explanatory variables
my_model1 <- lm(points ~ attitude + age + gender, data = std14)
# print out a summary of the model
summary(my_model1)
##
## Call:
## lm(formula = points ~ attitude + age + gender, data = std14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4590 -3.3221 0.2186 4.0247 10.4632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.75963 2.31478 5.944 1.65e-08 ***
## attitude 3.60657 0.59322 6.080 8.34e-09 ***
## age -0.07586 0.05367 -1.414 0.159
## gender -0.33054 0.91934 -0.360 0.720
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared: 0.2018, Adjusted R-squared: 0.187
## F-statistic: 13.65 on 3 and 162 DF, p-value: 5.536e-08
# draw diagnostic plots using the plot() function. Choose the plots Residuals vs Fitted values = 1, Normal QQ-plot = 2 and Residuals vs Leverage = 5
par(mfrow = c(2,2))
plot(my_model1, which = c(1,2,5))
Let’s explain the analysis output step by step.
As you can see, the first item shown in the output is the formula R used to fit the data. Note the simplicity in the syntax: the formula just needs the predictors (attitude, age, gender) and the target/response variable (points), together with the data being used (std14).
The next item in the model output talks about the residuals. Residuals are essentially the difference between the actual observed response values and the response values that the model predicted. The Residuals section of the model output breaks it down into 5 summary points. When assessing how well the model fit the data, you should look for a symmetrical distribution across these points on the mean value zero (0).
The next section in the model output talks about the coefficients of the model.
The coefficient Estimate contains two rows; the first one is the intercept. The intercept is the point where the function crosses the y-axis. The second row in the Coefficients is the slope. The slope term in our model is saying that for every attitude increase required the points goes up by 3.6.
The coefficient Standard Error measures the average amount that the coefficient estimates vary from the actual average value of our response variable.
The coefficient t-value is a measure of how many standard deviations our coefficient estimate is far away from 0. We want it to be far away from zero as this would indicate we could reject the null hypothesis - that is, we could declare a relationship between attitude and exam points.
The Pr(>t) acronym found in the model output relates to the probability of observing any value equal or larger than t. A small p-value indicates that it is unlikely we will observe a relationship between the predictors (attitude, age and gender) and response (exam points) variables due to chance. Typically, a p-value of 5% or less is a good cut-off point. In our model example, the p-values are very close to zero. Note the ‘signif. Codes’ associated to each estimate. Three stars (or asterisks) represent a highly significant p-value. Consequently, a small p-value for the intercept and the slope indicates that we can reject the null hypothesis which allows us to conclude that there is a relationship between attitude and exam points.
Residual Standard Error is measure of the quality of a linear regression fit. Theoretically, every linear model is assumed to contain an error term E. Due to the presence of this error term, we are not capable of perfectly predicting our response variable (exam points) from the predictors (attitude, age and gender) one. The Residual Standard Error is the average amount that the response (exam points) will deviate from the true regression line. In our example, the actual attitude value can deviate from the true regression line by approximately 5.315 points, on average.
The R-squared (R2) statistic provides a measure of how well the model is fitting the actual data. It takes the form of a proportion of variance. R2 is a measure of the linear relationship between our predictor variable (attitude, age and gender) and our response / target variable (exam points). It always lies between 0 and 1 (i.e.: a number near 0 represents a regression that does not explain the variance in the response variable well and a number close to 1 does explain the observed variance in the response variable). In our example, the R2 we get is 0.2018. Or roughly 20% of the variance found in the response variable (exam points) can be explained by the predictor variable (attitude, age and gender).
F-statistic is a good indicator of whether there is a relationship between our predictor and the response variables. The further the F-statistic is from 1 the better it is. However, how much larger the F-statistic needs to be depends on both the number of data points and the number of predictors. Generally, when the number of data points is large, an F-statistic that is only a little bit larger than 1 is already sufficient to reject the null hypothesis (H0 : There is no relationship between attitude+age+gender, and exam points). The reverse is true as if the number of data points is small, a large F-statistic is required to be able to ascertain that there may be a relationship between predictor and response variables. In our example the F-statistic is 13,65 which is relatively larger than 1 given the size of our data.
Last I checked graphically the validity of the model assumptions. For that I produced the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Let’s begin by looking at the Residual-Fitted plot coming from a linear model that is fit to data that perfectly satisfies all the of the standard assumptions of linear regression. The scatterplot shows good setup for a linear regression: The data appear to be well modeled by a linear relationship between y and x, and the points appear to be randomly spread out about the line, with no discerninle non-linear trends or changes in variability.
The Normal QQ plot helps us to assess whether the residuals are roughly normally distributed. In this case residual match pretty good to the diagonal line. It means that residuals are pretty normally distributed (that is on another assumption).
Outliers and the Residuals vs Leverage plot. There’s no single accepted definition for what consitutes an outlier. This case is the typical look when there is no influential case, or cases. Because we can not see Cook’s distance lines (a red dashed line) because all cases are well inside of the Cook’s distance lines.
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
Source:
Paulo Cortez, University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez
Relevant Papers:
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.
Let’s start working!
# read the data into memory
alc <- read.csv("C:/Users/juusov/Documents/IODS-project/Data/alc.csv", header = TRUE, sep = ",")
# print out the names of the variables in the data
names(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
My aim is find out how age, free time after school, current health status, and number of school absences associated with high/low alcohol consumption among students. My hypothesis is that among heavy drinkers (who are more frequently men than women) have more school absences and free time, they are older, and they have poorer perceived health. Let’s pick the variables we’re interested in and look at some basic statistics.
# access the tidyverse libraries dplyr, ggplot2, corrplot, and boot
library(tidyr); library(dplyr); library(ggplot2); library(corrplot); library(boot)
# produce mean statistics by group
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_age = mean(age), mean_free_time = mean(freetime), mean_health = mean(health), mean_absence = mean(absences))
## # A tibble: 4 x 7
## # Groups: sex [2]
## sex high_use count mean_age mean_free_time mean_health mean_absence
## <fct> <lgl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 F FALSE 156 16.6 2.93 3.38 4.22
## 2 F TRUE 42 16.5 3.36 3.40 6.79
## 3 M FALSE 112 16.3 3.39 3.71 2.98
## 4 M TRUE 72 17.0 3.5 3.88 6.12
Results are grouped by sex and high/low alcohol consumption among students. We can see that among female there is 156 low/moderate drinkers and 42 heavy drinkers. Respectively in men there 112 low/moderate drinkers and 72 heavy users. Forunately in both sex there is more low/moderate drinkers than heavy drinkers. See other details from above.
# boxplots all populatio
par(mfrow=c(1,5))
boxplot(alc$age, main="Age")
boxplot(alc$freetime, main="Freetime")
boxplot(alc$health, main=" Current Health Status")
boxplot(alc$absences, main="Number of School Absences")
boxplot(alc$alc_use, main="Alcohol using")
# boxplots by sex
par(mfrow=c(1,5))
boxplot(alc$age~alc$sex, main="Age")
boxplot(alc$freetime~alc$sex, main="Freetime")
boxplot(alc$health~alc$sex, main=" Current Health Status")
boxplot(alc$absences~alc$sex, main="Number of School Absences")
boxplot(alc$alc_use~alc$sex, main="Alcohol using")
# boxplots by alcohol high use
par(mfrow=c(1,4))
boxplot(alc$age~alc$high_use, main="Age")
boxplot(alc$freetime~alc$high_use, main="Freetime")
boxplot(alc$health~alc$high_use, main=" Current Health Status")
boxplot(alc$absences~alc$high_use, main="Number of School Absences")
# choose columns to keep for the analyses
keep_columns <- c("age", "sex", "freetime", "health", "absences", "alc_use", "high_use")
# select the 'alc_subset' to create a new dataset
alc_subset <- dplyr::select(alc, one_of(keep_columns))
# draw a bar plot of each variable
gather(alc_subset) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
As we can see from distributions plots and bars only sex and freetime are normally distributed. My hypothesis is partially true. Male seems to use more alcohol than women. Heavy drinkers are older than moderate drinkers and they have more school absences but there is no diffrences between drinking habits and freetime or current health status.
# model with glm
m <- glm(alc_subset$high_use ~ alc_subset$age + alc_subset$sex + alc_subset$freetime + alc_subset$health + alc_subset$absences, data = alc, family = "binomial")
#print out summary
summary(m)
##
## Call:
## glm(formula = alc_subset$high_use ~ alc_subset$age + alc_subset$sex +
## alc_subset$freetime + alc_subset$health + alc_subset$absences,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1098 -0.8203 -0.6121 1.0681 2.0876
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.94027 1.81093 -3.280 0.001037 **
## alc_subset$age 0.18163 0.10220 1.777 0.075542 .
## alc_subset$sexM 0.86250 0.24770 3.482 0.000498 ***
## alc_subset$freetime 0.28776 0.12533 2.296 0.021677 *
## alc_subset$health 0.05873 0.08800 0.667 0.504507
## alc_subset$absences 0.09335 0.02301 4.058 4.95e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 420.99 on 376 degrees of freedom
## AIC: 432.99
##
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.002631321 7.021357e-05 0.08669783
## alc_subset$age 1.199171825 9.830119e-01 1.46888737
## alc_subset$sexM 2.369086597 1.464751e+00 3.87560256
## alc_subset$freetime 1.333435235 1.045768e+00 1.71125765
## alc_subset$health 1.060491092 8.937200e-01 1.26298969
## alc_subset$absences 1.097850460 1.051579e+00 1.15103254
“When a logistic regression is calculated, the regression coefficient (b1) is the estimated increase in the log odds of the outcome per unit increase in the value of the exposure. In other words, the exponential function of the regression coefficient (eb1) is the odds ratio associated with a one-unit increase in the exposure. An odds ratio (OR) is a measure of association between an exposure and an outcome. The OR represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure.” (Szumilas M. Explaining odds ratios [published correction appears in J Can Acad Child Adolesc Psychiatry. 2015 Winter;24(1):58]. J Can Acad Child Adolesc Psychiatry. 2010;19(3):227–229.)
Let’s look at coefficients first. In this case sex, freetime, and school absences significantly associated with alchol high use. If we look at the odds ratios (OR). We can conclude that sex increase 2.36 (136%) times, freetime 1.33 (33%) times, and school absences 1.09 (9%) times risk for alcohol high use. This analysis get us closer to final conclusion. The hypothesis is still alive partly, now we can say that sex, freetime and school absences statistically associated with higher alcohol consumption in this population.
Next we can compare the values predicted with the real values and estimate how good our model is in prediction. In conclusion we can say that the model accuracy is acceptable.
#fit the model
m2 <- glm(high_use ~ sex + freetime + absences, data = alc_subset, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m2, type = "response")
# add the predicted probabilities to 'alc_subset'
alc_subset <- mutate(alc_subset, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc_subset <- mutate(alc_subset, prediction = probability > 0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(alc_subset, sex, freetime, absences, high_use, probability, prediction) %>% tail(20)
## sex freetime absences high_use probability prediction
## 363 F 4 8 FALSE 0.30998649 FALSE
## 364 F 5 9 FALSE 0.39835071 FALSE
## 365 F 4 0 FALSE 0.17042678 FALSE
## 366 F 3 3 FALSE 0.17090391 FALSE
## 367 F 4 2 TRUE 0.19988715 FALSE
## 368 F 1 0 FALSE 0.07923997 FALSE
## 369 F 5 14 TRUE 0.51915876 TRUE
## 370 M 2 4 TRUE 0.28999668 FALSE
## 371 M 4 2 FALSE 0.37497539 FALSE
## 372 M 4 3 FALSE 0.39816238 FALSE
## 373 M 3 0 FALSE 0.26961553 FALSE
## 374 M 4 7 TRUE 0.49452118 FALSE
## 375 F 3 1 FALSE 0.14494141 FALSE
## 376 F 4 6 FALSE 0.26977031 FALSE
## 377 F 4 2 FALSE 0.19988715 FALSE
## 378 F 3 2 FALSE 0.15748815 FALSE
## 379 F 2 2 FALSE 0.12270339 FALSE
## 380 F 1 3 FALSE 0.10346444 FALSE
## 381 M 4 4 TRUE 0.42181554 FALSE
## 382 M 4 2 TRUE 0.37497539 FALSE
# initialize a plot of 'high_use' versus 'probability' in 'alc_subset'
g <- ggplot(alc_subset, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
geom_point(col = 'prediction')
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
g
# tabulate the target variable versus the predictions
table(high_use = alc_subset$high_use, prediction = alc_subset$prediction)%>%prop.table()%>%addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66492147 0.03664921 0.70157068
## TRUE 0.23036649 0.06806283 0.29842932
## Sum 0.89528796 0.10471204 1.00000000
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the data
loss_func(class = alc_subset$high_use, prob = alc_subset$probability)
## [1] 0.2670157
# K-fold cross-validation
cv <- cv.glm(data = alc_subset, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.3442614
# access the packages
library(MASS); library(corrplot); library(tidyr); library(corrplot); library(dplyr); library(ggplot2);
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
# explore the dataset
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
“Boston {MASS}” dataset consist of housing values in suburbs of Boston. The Boston data frame has 506 rows and 14 columns.
This data frame contains the following variables:
crim per capita crime rate by town.
zn proportion of residential land zoned for lots over 25,000 sq.ft.
indus proportion of non-retail business acres per town.
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox nitrogen oxides concentration (parts per 10 million).
rm average number of rooms per dwelling.
age proportion of owner-occupied units built prior to 1940.
dis weighted mean of distances to five Boston employment centres.
rad index of accessibility to radial highways.
tax full-value property-tax rate per $10,000.
ptratio pupil-teacher ratio by town.
black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat lower status of the population (percent).
medv median value of owner-occupied homes in $1000s.
# Change the shape of the data from wide-format to long-format
require(reshape2)
## Loading required package: reshape2
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
melt.boston <- melt(Boston)
## No id variables; using all as measure variables
head(melt.boston)
## variable value
## 1 crim 0.00632
## 2 crim 0.02731
## 3 crim 0.02729
## 4 crim 0.03237
## 5 crim 0.06905
## 6 crim 0.02985
# draw a bar plot of each variable
ggplot(data = melt.boston, aes(x = value)) + stat_density() + facet_wrap(~variable, scales = "free")
# plot matrix of the Boston dataset variables
pairs(Boston)
# calculate the correlation matrix of the Boston dataset and round it
cor_matrix<-cor(Boston)
# print the correlation matrix
cor_matrix %>% round(digits = 2)
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# visualize the correlation matrix of the dataset
corrplot(cor_matrix, method="number", type='upper', diag = FALSE)
Several of the variables are highly skewed.In particular, crim, zn, chaz, dis, and black are highly skewed. Some of the others appear to have moderate skewness. The skewed distributions suggests that some transformations on variables could improve performance of variables in the models. We can observe several highly correlated variables in the correlation matrix. We have to be careful with highly correlated variables to avoid overcome their influence in the models. The next thing we need to do is standardize the dataset and print out summaries of the scaled data, then create a categorical variable of the crime rate in the Boston dataset using the quantiles as the break points, drop the old crime rate variable from the dataset, and create training and testing data (80% of the data belongs to the train set).
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'. Using the quantiles as the break points in the categorical variable.
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
Now the test data has created. Next we going to fit the linear discriminant analysis on the train dataset. Notice that in this case we have four classes. The LDA algorithm starts by finding directions that maximize the separation between classes, then use these directions to predict the class of individuals. These directions, called linear discriminants, are a linear combinations of predictor variables.
LDA assumes that predictors are normally distributed (Gaussian distribution) and that the different classes have class-specific means and equal variance/covariance.
LDA determines group means and computes, for each individual, the probability of belonging to the different groups. The individual is then affected to the group with the highest probability score.
The lda() outputs contain the following elements:
Prior probabilities of groups: the proportion of training observations in each group. Group means: Shows the mean of each variable in each group. Coefficients of linear discriminants: Shows the linear combination of predictor variables that are used to form the LDA decision rule.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2400990 0.2623762 0.2351485 0.2623762
##
## Group means:
## zn indus chas nox rm age
## low 1.05523165 -0.8999229 -0.15056308 -0.8627673 0.4563859 -0.8953743
## med_low -0.09022217 -0.3493254 -0.01233188 -0.5625464 -0.1644558 -0.3292601
## med_high -0.37801632 0.1810955 0.14210254 0.3921530 0.1286878 0.4188575
## high -0.48724019 1.0149946 -0.08661679 1.0554659 -0.3383475 0.7876455
## dis rad tax ptratio black lstat
## low 0.8469751 -0.6823226 -0.7322566 -0.4518423 0.38244706 -0.78280709
## med_low 0.3797055 -0.5430701 -0.5188454 -0.0827358 0.31375658 -0.10882414
## med_high -0.3745698 -0.3858773 -0.2927688 -0.2925840 0.08463115 0.04074139
## high -0.8276226 1.6596029 1.5294129 0.8057784 -0.82082851 0.87284906
## medv
## low 0.53145122
## med_low -0.02582588
## med_high 0.18928544
## high -0.75421348
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.097249444 0.903944588 -0.95817593
## indus 0.016802108 -0.231320969 0.05292192
## chas -0.039884933 -0.127705170 0.13083156
## nox 0.277021160 -0.627650631 -1.29971686
## rm -0.095090865 0.004659735 -0.20831764
## age 0.328694756 -0.389021783 -0.10184085
## dis -0.076220352 -0.510436772 0.28673046
## rad 3.176294850 0.969655055 0.25179840
## tax 0.008119383 -0.197837677 0.33726198
## ptratio 0.117773609 0.110164705 -0.29037470
## black -0.164439417 0.023871831 0.09633785
## lstat 0.145913749 -0.317075357 0.32762511
## medv 0.136693503 -0.505580055 -0.28060922
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9457 0.0398 0.0144
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
The train data was devided in quantiles. The crime variable is as actarget variable. In the plot we see four different clusters. Three of them are in overlapped and one cluster is far away from other clusters. Look at the arrows tells us which of the affect most on the classification (rad, zn, nox) but because there is so much variables it is hard to recognize other variables.
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 14 16 0 0
## med_low 3 12 5 0
## med_high 0 13 18 0
## high 0 0 1 20
#Calculate accuracy percent of the model
correct_predicts <- 100 * mean(lda.pred$class==correct_classes)
correct_predicts <- round(correct_predicts, digits = 0)
#Print correct predicts percentage
print(correct_predicts)
## [1] 63
We split our data earlier so that we have the test set and the correct class labels. The prediction model perform on test data is acceptable but not perfect (prediction accuracy is 75%). It predicts high crime rate perfectly but lower rates worse.
“Clustering is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as euclidean-based distance or correlation-based distance. The decision of which similarity measure to use is application-specific.” (https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a)
# load the data
data("Boston")
# Standardizing Boston dataset
scaled_boston <- scale(Boston)
# euclidean distance matrix
dist_eu <- dist(scaled_boston)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(scaled_boston, method = 'manhattan')
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
# k-means clustering
km <-kmeans(scaled_boston, centers = 3)
# plot the scaled_oston dataset with clusters
pairs(scaled_boston, col = km$cluster)
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(scaled_boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <-kmeans(scaled_boston, centers = 3)
# plot the scaled_boston dataset with clusters
pairs(scaled_boston, col = km$cluster)
I tested many different number of clusters. Based on visualiztion the results suggest that 3 is the optimal number of clusters as it appears to be the bend in the elbow (= when the total WCSS drops radically).
# load the data
data("Boston")
# Standardizing Boston dataset
scaled_kmeans_boston <- scale(Boston)
scaled_kmeans_boston <- as.data.frame(scaled_kmeans_boston)
# k-means clustering
km <-kmeans(scaled_kmeans_boston, centers = 3)
lda_kmeans <- lda(km$cluster ~ ., data = scaled_kmeans_boston)
lda_kmeans
## Call:
## lda(km$cluster ~ ., data = scaled_kmeans_boston)
##
## Prior probabilities of groups:
## 1 2 3
## 0.2470356 0.3260870 0.4268775
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.3989700 1.2614609 -0.9791535 -0.020354653 -0.8573235 1.0090468
## 2 0.7982270 -0.4872402 1.1186734 0.014005495 1.1351215 -0.4596725
## 3 -0.3788713 -0.3578148 -0.2879024 0.001080671 -0.3709704 -0.2328004
## age dis rad tax ptratio black
## 1 -0.96130713 0.9497716 -0.5867985 -0.6709807 -0.80239137 0.3552363
## 2 0.79930921 -0.8549214 1.2113527 1.2873657 0.59162230 -0.6363367
## 3 -0.05427143 0.1034286 -0.5857564 -0.5951053 0.01241316 0.2805140
## lstat medv
## 1 -0.9571271 1.06668290
## 2 0.8622388 -0.67953738
## 3 -0.1047617 -0.09820229
##
## Coefficients of linear discriminants:
## LD1 LD2
## crim -0.03206338 -0.19094456
## zn 0.02935900 -1.07677218
## indus 0.63347352 -0.09917524
## chas 0.02460719 0.10009606
## nox 1.11749317 -0.75995105
## rm -0.18841682 -0.57360135
## age -0.12983139 0.47226685
## dis 0.04493809 -0.34585958
## rad 0.67004295 -0.08584353
## tax 1.03992455 -0.58075025
## ptratio 0.25864960 -0.02605279
## black -0.01657236 0.01975686
## lstat 0.17365575 -0.41704235
## medv -0.06819126 -0.79098605
##
## Proportion of trace:
## LD1 LD2
## 0.8506 0.1494
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda_kmeans, dimen = 2, col = classes, pch = classes)
lda.arrows(lda_kmeans, myscale = 4)
In the plot we see two overlapped cluster and one cluster which away from other clusters. The arrows tells us thatnox, zn, tax and medv the most influential variables in the model.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = classes)
# access the packages
library(MASS); library(corrplot); library(tidyr); library(corrplot); library(dplyr); library(ggplot2); library(GGally); library(psych); library(DescTools);
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
##
## Attaching package: 'psych'
## The following object is masked from 'package:boot':
##
## logit
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
##
## Attaching package: 'DescTools'
## The following objects are masked from 'package:psych':
##
## AUC, ICC, SD
# read the data into memory
human <- read.csv("C:/Users/juusov/Documents/IODS-project/Data/human.csv", row.names = 1)
# Explore the structure and the dimensions of the data
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
colnames(human)
## [1] "Edu2.FM" "Labo.FM" "Edu.Exp" "Life.Exp" "GNI" "Mat.Mor"
## [7] "Ado.Birth" "Parli.F"
row.names(human)
## [1] "Norway"
## [2] "Australia"
## [3] "Switzerland"
## [4] "Denmark"
## [5] "Netherlands"
## [6] "Germany"
## [7] "Ireland"
## [8] "United States"
## [9] "Canada"
## [10] "New Zealand"
## [11] "Singapore"
## [12] "Sweden"
## [13] "United Kingdom"
## [14] "Iceland"
## [15] "Korea (Republic of)"
## [16] "Israel"
## [17] "Luxembourg"
## [18] "Japan"
## [19] "Belgium"
## [20] "France"
## [21] "Austria"
## [22] "Finland"
## [23] "Slovenia"
## [24] "Spain"
## [25] "Italy"
## [26] "Czech Republic"
## [27] "Greece"
## [28] "Estonia"
## [29] "Cyprus"
## [30] "Qatar"
## [31] "Slovakia"
## [32] "Poland"
## [33] "Lithuania"
## [34] "Malta"
## [35] "Saudi Arabia"
## [36] "Argentina"
## [37] "United Arab Emirates"
## [38] "Chile"
## [39] "Portugal"
## [40] "Hungary"
## [41] "Bahrain"
## [42] "Latvia"
## [43] "Croatia"
## [44] "Kuwait"
## [45] "Montenegro"
## [46] "Belarus"
## [47] "Russian Federation"
## [48] "Oman"
## [49] "Romania"
## [50] "Uruguay"
## [51] "Bahamas"
## [52] "Kazakhstan"
## [53] "Barbados"
## [54] "Bulgaria"
## [55] "Panama"
## [56] "Malaysia"
## [57] "Mauritius"
## [58] "Trinidad and Tobago"
## [59] "Serbia"
## [60] "Cuba"
## [61] "Lebanon"
## [62] "Costa Rica"
## [63] "Iran (Islamic Republic of)"
## [64] "Venezuela (Bolivarian Republic of)"
## [65] "Turkey"
## [66] "Sri Lanka"
## [67] "Mexico"
## [68] "Brazil"
## [69] "Georgia"
## [70] "Azerbaijan"
## [71] "Jordan"
## [72] "The former Yugoslav Republic of Macedonia"
## [73] "Ukraine"
## [74] "Algeria"
## [75] "Peru"
## [76] "Albania"
## [77] "Armenia"
## [78] "Bosnia and Herzegovina"
## [79] "Ecuador"
## [80] "China"
## [81] "Fiji"
## [82] "Mongolia"
## [83] "Thailand"
## [84] "Libya"
## [85] "Tunisia"
## [86] "Colombia"
## [87] "Jamaica"
## [88] "Tonga"
## [89] "Belize"
## [90] "Dominican Republic"
## [91] "Suriname"
## [92] "Maldives"
## [93] "Samoa"
## [94] "Botswana"
## [95] "Moldova (Republic of)"
## [96] "Egypt"
## [97] "Gabon"
## [98] "Indonesia"
## [99] "Paraguay"
## [100] "Philippines"
## [101] "El Salvador"
## [102] "South Africa"
## [103] "Viet Nam"
## [104] "Bolivia (Plurinational State of)"
## [105] "Kyrgyzstan"
## [106] "Iraq"
## [107] "Guyana"
## [108] "Nicaragua"
## [109] "Morocco"
## [110] "Namibia"
## [111] "Guatemala"
## [112] "Tajikistan"
## [113] "India"
## [114] "Honduras"
## [115] "Bhutan"
## [116] "Syrian Arab Republic"
## [117] "Congo"
## [118] "Zambia"
## [119] "Ghana"
## [120] "Bangladesh"
## [121] "Cambodia"
## [122] "Kenya"
## [123] "Nepal"
## [124] "Pakistan"
## [125] "Myanmar"
## [126] "Swaziland"
## [127] "Tanzania (United Republic of)"
## [128] "Cameroon"
## [129] "Zimbabwe"
## [130] "Mauritania"
## [131] "Papua New Guinea"
## [132] "Yemen"
## [133] "Lesotho"
## [134] "Togo"
## [135] "Haiti"
## [136] "Rwanda"
## [137] "Uganda"
## [138] "Benin"
## [139] "Sudan"
## [140] "Senegal"
## [141] "Afghanistan"
## [142] "Côte d'Ivoire"
## [143] "Malawi"
## [144] "Ethiopia"
## [145] "Gambia"
## [146] "Congo (Democratic Republic of the)"
## [147] "Liberia"
## [148] "Mali"
## [149] "Mozambique"
## [150] "Sierra Leone"
## [151] "Burkina Faso"
## [152] "Burundi"
## [153] "Chad"
## [154] "Central African Republic"
## [155] "Niger"
describe(human)
## vars n mean sd median trimmed mad min
## Edu2.FM 1 155 0.85 0.24 0.94 0.87 0.12 0.17
## Labo.FM 2 155 0.71 0.20 0.75 0.73 0.17 0.19
## Edu.Exp 3 155 13.18 2.84 13.50 13.24 2.97 5.40
## Life.Exp 4 155 71.65 8.33 74.20 72.40 7.56 49.00
## GNI 5 155 17627.90 18543.85 12040.00 14552.58 13337.47 581.00
## Mat.Mor 6 155 149.08 211.79 49.00 104.70 63.75 1.00
## Ado.Birth 7 155 47.16 41.11 33.60 41.62 35.73 0.60
## Parli.F 8 155 20.91 11.49 19.30 20.32 11.42 0.00
## max range skew kurtosis se
## Edu2.FM 1.50 1.33 -0.76 0.55 0.02
## Labo.FM 1.04 0.85 -0.87 0.05 0.02
## Edu.Exp 20.20 14.80 -0.20 -0.34 0.23
## Life.Exp 83.50 34.50 -0.76 -0.15 0.67
## GNI 123124.00 122543.00 2.14 6.83 1489.48
## Mat.Mor 1100.00 1099.00 2.03 4.16 17.01
## Ado.Birth 204.80 204.20 1.13 0.89 3.30
## Parli.F 57.50 57.50 0.55 -0.10 0.92
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
The ‘human’ dataset originates from the United Nations Development Programme. Human Development Indicators and Indices povide an overview of key aspects of human development.The data combines several indicators from most countries in the world. This data (19 diffrent variables and 195 observations) includes following variables:
# Draw distributions and correlations
ggpairs(human, lower = list(continuous = "smooth_loess")) + theme_classic()
# Draw correlation plot
cor(human)%>%corrplot(method="number", type='upper', diag = FALSE)
Most of the variables are highly skewed. Only two of them are nearly normally distributed (“Edu.Exp” and “Parli.F”). The skewed distributions suggests that some transformations on variables could improve performance of variables in the models. There seeems to be many strong correlation coefficients and some weak correlation coefficients, especially Parli.F.
##PCA
# perform principal component analysis (with the SVD method)
pca_human_not_std <- prcomp(human)
sum_pca_human_not_std <- summary(pca_human_not_std)
pca_pr_not_std <- round(100*sum_pca_human_not_std$importance[2, ], digits = 3)
pca_pr_not_std
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 99.99 0.01 0.00 0.00 0.00 0.00 0.00 0.00
# standardize the variables
human_std <- scale(human)
# perform principal component analysis (with the SVD method)
pca_human_std <- prcomp(human_std)
sum_pca_human_std <- summary(pca_human_std)
pca_human_std <- round(100*sum_pca_human_std$importance[2, ], digits = 3)
pca_human_std
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.605 16.237 9.571 7.583 5.477 3.595 2.634 1.298
# perform principal component analysis (with the SVD method) without standardizing
pca_human_not <- prcomp(human)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human_not, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
# perform principal component analysis (with the SVD method) with standardizing
pca_human <- prcomp(human_std)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
From above you can find principal component analysis (PCA) on the not standardized human (the first one) and with standardizing (the last one). PCA will load on the large variances. Because it’s trying to capture the total variance in the set of variables, PCA requires that the input variables have similar scales of measurement. After the scaling (standardizing) all measured on the same scale and the variances will be relatively similar. Due the that it makes sense to standardize variables in the data.
After stdardizing we can see that all the principal components captured data, before standardizing only two captrured data. A biplot visualizing the connections between two representations of the same data. First, a simple scatter plot is drawn where the observations are represented by two principal components (PC’s). Then, arrows are drawn to visualize the connections between the original variables and the PC’s. The following connections hold: 1.) The angle between the arrows can be interpret as the correlation between the variables. 2.) The angle between a variable and a PC axis can be interpret as the correlation between the two. 3.)The length of the arrows are proportional to the standard deviations of the variables.
PCA results indicating that PC1 captures 53.6% of the variance in the data while PC2 16.2% variance, so the first two PC’s explain about 70 % of the total variance in the data. PC 1 includes Edu.Exp, Mat.Mor, Life.Exp and Ado.Birth. PC2 includes Parli.F and Labo.F. Small angels of the arrows indicate positive correlation between variables (both variables (=arrows) are close to each other). In conclusion we can detect two PC’s the first one related to basic life standards and qualities and the second one to genders equality.
# access the package
library(FactoMineR)
# load the data
data("tea")
colnames(data)
## NULL
str(data)
## function (..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"),
## envir = .GlobalEnv, overwrite = TRUE)
dim(data)
## NULL
Tea dataset includes 36 different variables and 300 observations. Most of the variables are categorical variables. Only the age is a integer. Let´s pickup some variables into the subset of the Tea data. Our aim is use that subset for Multiple Correspondence Analysis (MCA).
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))
# look at the summaries and structure of the data
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
The subset includes six diffrent categorical variables (“Tea”, “How”, “how”, “sugar”, “where”, “lunch”). The dataset contains the answers of a questionnaire on tea consumption. Let’s look at the MCA, which is a method to analyze qualitative data and it is an extension of Correspondence analysis (CA). MCA can be used to detect patterns or structure in the data as well as in dimension reduction.
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")
MCA is for summarizing and visualizing a data table containing more than two categorical variables. It can also be seen as a generalization of principal component analysis when the variables to be analyzed are categorical instead of quantitative (Abdi and Williams 2010). MCA is generally used to analyse a data set from survey. The goal is to identify: 1.) A group of individuals with similar profile in their answers to the questions The associations between variable categories (http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/114-mca-multiple-correspondence-analysis-in-r-essentials/).
Let´s look at the results. First, two dimensions captured about 30% of the total variance. In the picture, we can see that those variables which are near together correlated positively together and vice-versa. Practically it means for example that people who went to the tea shop use more unpackaged green tea. As well as people who went to the chain store and tea shop use tea bags + unpackaged tea. As well people who only went to the chain store use more likely tea bags.
You can find the data wranglin exercise from here. In the data wrangling exercise, we reshaped the data from the wide format into the long format. In the wide format, a subject’s repeated measures were in a single row, and each weeks is in a separate column.In the long format, each row is one time point per subject. So each subjects have data in multiple rows.The main reason for setting up the data in one format or the other is simply that different analyses require different set ups. From below you can se the diffrence between the wide (BPRS & RATS) and long format (BPRSL & RATSL) after the data wrangling.
# Access to libraries
library(tidyr); library(dplyr); library(ggplot2)
# Loading the data
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", header = TRUE, sep = " ")
BPRS <- as.data.frame(BPRS)
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep ="\t")
RATS <- as.data.frame(RATS)
# Look at the data in wide format
names(BPRS)
## [1] "treatment" "subject" "week0" "week1" "week2" "week3"
## [7] "week4" "week5" "week6" "week7" "week8"
str(BPRS)
## 'data.frame': 40 obs. of 11 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ week0 : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week1 : int 36 68 55 77 75 43 61 36 43 51 ...
## $ week2 : int 36 61 41 49 72 41 47 38 39 51 ...
## $ week3 : int 43 55 38 54 65 38 30 38 35 55 ...
## $ week4 : int 41 43 43 56 50 36 27 31 28 53 ...
## $ week5 : int 40 34 28 50 39 29 40 26 22 43 ...
## $ week6 : int 38 28 29 47 32 33 30 26 20 43 ...
## $ week7 : int 47 28 25 42 38 27 31 25 23 39 ...
## $ week8 : int 51 28 24 46 32 25 31 24 21 32 ...
head(BPRS)
## treatment subject week0 week1 week2 week3 week4 week5 week6 week7 week8
## 1 1 1 42 36 36 43 41 40 38 47 51
## 2 1 2 58 68 61 55 43 34 28 28 28
## 3 1 3 54 55 41 38 43 28 29 25 24
## 4 1 4 55 77 49 54 56 50 47 42 46
## 5 1 5 72 75 72 65 50 39 32 38 32
## 6 1 6 48 43 41 38 36 29 33 27 25
str(RATS)
## 'data.frame': 16 obs. of 13 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Group: int 1 1 1 1 1 1 1 1 2 2 ...
## $ WD1 : int 240 225 245 260 255 260 275 245 410 405 ...
## $ WD8 : int 250 230 250 255 260 265 275 255 415 420 ...
## $ WD15 : int 255 230 250 255 255 270 260 260 425 430 ...
## $ WD22 : int 260 232 255 265 270 275 270 268 428 440 ...
## $ WD29 : int 262 240 262 265 270 275 273 270 438 448 ...
## $ WD36 : int 258 240 265 268 273 277 274 265 443 460 ...
## $ WD43 : int 266 243 267 270 274 278 276 265 442 458 ...
## $ WD44 : int 266 244 267 272 273 278 271 267 446 464 ...
## $ WD50 : int 265 238 264 274 276 284 282 273 456 475 ...
## $ WD57 : int 272 247 268 273 278 279 281 274 468 484 ...
## $ WD64 : int 278 245 269 275 280 281 284 278 478 496 ...
names(RATS)
## [1] "ID" "Group" "WD1" "WD8" "WD15" "WD22" "WD29" "WD36" "WD43"
## [10] "WD44" "WD50" "WD57" "WD64"
head(RATS)
## ID Group WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64
## 1 1 1 240 250 255 260 262 258 266 266 265 272 278
## 2 2 1 225 230 230 232 240 240 243 244 238 247 245
## 3 3 1 245 250 250 255 262 265 267 267 264 268 269
## 4 4 1 260 255 255 265 265 268 270 272 274 273 275
## 5 5 1 255 260 255 270 270 273 274 273 276 278 280
## 6 6 1 260 265 270 275 275 277 278 278 284 279 281
# BPRS includes 40 obs. of 11 variables in wide format
# RATS includes 16 obs. of 13 variables in wide format
# Categorical variables to factor
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
# Converting data sets to from wide format to long format and mutate 'weeks' variable to BPRSL and 'Time' to RATSL
library(dplyr)
library(tidyr)
BPRSL <- BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)
BPRSL <- BPRSL %>% mutate(week = as.integer(substr(weeks, 5, 5)))
RATSL <- RATS %>% gather(key = WD, value = Weight, -ID, -Group) %>% mutate(Time = as.integer(substr(WD, 3, 4)))
# Look at the data in long format
names(BPRSL)
## [1] "treatment" "subject" "weeks" "bprs" "week"
str(BPRSL)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
head(BPRSL)
## treatment subject weeks bprs week
## 1 1 1 week0 42 0
## 2 1 2 week0 58 0
## 3 1 3 week0 54 0
## 4 1 4 week0 55 0
## 5 1 5 week0 72 0
## 6 1 6 week0 48 0
str(RATSL)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : chr "WD1" "WD1" "WD1" "WD1" ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
names(RATSL)
## [1] "ID" "Group" "WD" "Weight" "Time"
head(RATSL)
## ID Group WD Weight Time
## 1 1 1 WD1 240 1
## 2 2 1 WD1 225 1
## 3 3 1 WD1 245 1
## 4 4 1 WD1 260 1
## 5 5 1 WD1 255 1
## 6 6 1 WD1 260 1
# Now in LONG format BRPRSL includes 360 obs. of 5 variables and RATSL in LONG format includes 176 obs. of 5 variables
# Saving the data
write.csv(BPRS, file = "C:/Users/juusov/Documents/IODS-project/Data/BPRS.csv")
write.csv(BPRSL, file = "C:/Users/juusov/Documents/IODS-project/Data/BPRSL.csv")
write.csv(RATS, file = "C:/Users/juusov/Documents/IODS-project/Data/RATS.csv")
write.csv(RATSL, file = "C:/Users/juusov/Documents/IODS-project/Data/RATSL.csv")
# Table 1
RATSL <- gather(RATS, key = WD, value = Weight, -ID, -Group) %>%
mutate(Time = as.integer(substr(WD,3,4)))
glimpse(RATSL)
## Observations: 176
## Variables: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,...
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, ...
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, ...
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, ...
head(RATSL); tail(RATSL)
## ID Group WD Weight Time
## 1 1 1 WD1 240 1
## 2 2 1 WD1 225 1
## 3 3 1 WD1 245 1
## 4 4 1 WD1 260 1
## 5 5 1 WD1 255 1
## 6 6 1 WD1 260 1
## ID Group WD Weight Time
## 171 11 2 WD64 472 64
## 172 12 2 WD64 628 64
## 173 13 3 WD64 525 64
## 174 14 3 WD64 559 64
## 175 15 3 WD64 548 64
## 176 16 3 WD64 569 64
# Figure 1.
ggplot(RATSL, aes(x = Time, y = Weight, group = ID)) +
geom_line(aes(linetype = Group)) + scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) + scale_y_continuous(name = "Weight (grams)") + theme(legend.position = "top")
# Figure 2.
ggplot(RATSL, aes(x = Time, y = Weight, group = ID)) +
geom_line(aes(linetype = Group)) + facet_grid(. ~ Group) + scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 20)) + scale_y_continuous(name = "Weight (grams)") + theme(legend.position = "top")
As we can see figures above the repeated measures are certainly not independent of one another. Next table above shows a linear regression model to RATS(L) data with ‘Weight’ as response variable, and ‘Group’ and ‘Time’ as explanatory Variables.
# Table 2
# create a regression model RATS_reg
RATS_reg <- lm(Weight ~ Time + Group, data = RATSL)
# print out a summary of the model
summary(RATS_reg)
##
## Call:
## lm(formula = Weight ~ Time + Group, data = RATSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.643 -24.017 0.697 10.837 125.459
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 244.0689 5.7725 42.281 < 2e-16 ***
## Time 0.5857 0.1331 4.402 1.88e-05 ***
## Group2 220.9886 6.3402 34.855 < 2e-16 ***
## Group3 262.0795 6.3402 41.336 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.34 on 172 degrees of freedom
## Multiple R-squared: 0.9283, Adjusted R-squared: 0.9271
## F-statistic: 742.6 on 3 and 172 DF, p-value: < 2.2e-16
# access library lme4
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
# Table 3
# Create a random intercept model
RATS_ref <- lmer(Weight ~ Time + Group + (1 | ID), data = RATSL, REML = FALSE)
# Print the summary of the model
summary(RATS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: Weight ~ Time + Group + (1 | ID)
## Data: RATSL
##
## AIC BIC logLik deviance df.resid
## 1333.2 1352.2 -660.6 1321.2 170
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.5386 -0.5581 -0.0494 0.5693 3.0990
##
## Random effects:
## Groups Name Variance Std.Dev.
## ID (Intercept) 1085.92 32.953
## Residual 66.44 8.151
## Number of obs: 176, groups: ID, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 244.06890 11.73107 20.80
## Time 0.58568 0.03158 18.54
## Group2 220.98864 20.23577 10.92
## Group3 262.07955 20.23577 12.95
##
## Correlation of Fixed Effects:
## (Intr) Time Group2
## Time -0.090
## Group2 -0.575 0.000
## Group3 -0.575 0.000 0.333
Now we can move on to fit the random intercept and random slope model to the rat growth data. Fitting a random intercept and random slope model allows the linear regression fits for each individual to differ in intercept but also in slope. This way it is possible to account for the individual differences in the rats’ growth profiles, but also the effect of time. Results from fitting random intercept model, with ‘Time’ and ‘Group’ as explanatory variables.
# create a random intercept and random slope model
RATS_ref1 <- lmer(Weight ~ Time + Group + (Time | ID), data = RATSL, REML = FALSE)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00952952 (tol = 0.002, component 1)
# print a summary of the model
summary(RATS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: Weight ~ Time + Group + (Time | ID)
## Data: RATSL
##
## AIC BIC logLik deviance df.resid
## 1194.2 1219.6 -589.1 1178.2 168
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2258 -0.4323 0.0554 0.5635 2.8821
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## ID (Intercept) 1138.783 33.7459
## Time 0.112 0.3346 -0.22
## Residual 19.750 4.4441
## Number of obs: 176, groups: ID, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 246.51474 11.80983 20.874
## Time 0.58568 0.08541 6.857
## Group2 214.43334 20.17706 10.628
## Group3 258.85148 20.17706 12.829
##
## Correlation of Fixed Effects:
## (Intr) Time Group2
## Time -0.164
## Group2 -0.569 0.000
## Group3 -0.569 0.000 0.333
## convergence code: 0
## Model failed to converge with max|grad| = 0.00952952 (tol = 0.002, component 1)
# perform an ANOVA test on the two models
anova(RATS_ref1, RATS_ref)
## Data: RATSL
## Models:
## RATS_ref: Weight ~ Time + Group + (1 | ID)
## RATS_ref1: Weight ~ Time + Group + (Time | ID)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## RATS_ref 6 1333.2 1352.2 -660.58 1321.2
## RATS_ref1 8 1194.2 1219.6 -589.11 1178.2 142.94 2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Finally, we can fit a random intercept and slope model that allows for a group × time interaction.
# create a random intercept and random slope model
RATS_ref2 <- lmer(Weight ~ Time * Group + (Time | ID), data = RATSL, REML = FALSE)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00701626 (tol = 0.002, component 1)
# print a summary of the model
summary(RATS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: Weight ~ Time * Group + (Time | ID)
## Data: RATSL
##
## AIC BIC logLik deviance df.resid
## 1185.9 1217.6 -582.9 1165.9 166
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2666 -0.4249 0.0726 0.6034 2.7510
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## ID (Intercept) 1.105e+03 33.2488
## Time 4.924e-02 0.2219 -0.15
## Residual 1.975e+01 4.4440
## Number of obs: 176, groups: ID, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 251.65165 11.79308 21.339
## Time 0.35964 0.08215 4.378
## Group2 200.66549 20.42622 9.824
## Group3 252.07168 20.42622 12.341
## Time:Group2 0.60584 0.14228 4.258
## Time:Group3 0.29834 0.14228 2.097
##
## Correlation of Fixed Effects:
## (Intr) Time Group2 Group3 Tm:Gr2
## Time -0.160
## Group2 -0.577 0.092
## Group3 -0.577 0.092 0.333
## Time:Group2 0.092 -0.577 -0.160 -0.053
## Time:Group3 0.092 -0.577 -0.053 -0.160 0.333
## convergence code: 0
## Model failed to converge with max|grad| = 0.00701626 (tol = 0.002, component 1)
# perform an ANOVA test on the two models
anova(RATS_ref2, RATS_ref1)
## Data: RATSL
## Models:
## RATS_ref1: Weight ~ Time + Group + (Time | ID)
## RATS_ref2: Weight ~ Time * Group + (Time | ID)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## RATS_ref1 8 1194.2 1219.6 -589.11 1178.2
## RATS_ref2 10 1185.9 1217.6 -582.93 1165.9 12.361 2 0.00207 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Figure 3
# draw the plot of RATSL
ggplot(RATSL, aes(x = Time, y = Weight, group = ID)) +
geom_line(aes(linetype = Group)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 20)) +
scale_y_continuous(name = "Observed weight (grams)") +
theme(legend.position = "top")
# Create a vector of the fitted values
Fitted <- fitted(RATS_ref2)
# Create a new column fitted to RATSL
RATSL <- RATSL %>%
mutate(Fitted)
# Figure 4
# draw the plot of RATSL
ggplot(RATSL, aes(x = Time, y = Fitted, group = ID)) +
geom_line(aes(linetype = Group)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 20)) +
scale_y_continuous(name = "Fitted weight (grams)") +
theme(legend.position = "top")
# Figures 5 & 6
Fitted <- fitted(RATS_ref2)
RATSL <- RATSL %>% mutate(Fitted)
p1 <- ggplot(RATSL, aes(x = Time, y = Weight, group = ID))
p2 <- p1 + geom_line(aes(linetype = Group))
p3 <- p2 + scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 20))
p4 <- p3 + scale_y_continuous(name = "Weight (grams)")
p5 <- p4 + theme_bw() + theme(legend.position = "right") # "none" in the book
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p7 <- p6 + ggtitle("Observed")
graph1 <- p7
p1 <- ggplot(RATSL, aes(x = Time, y = Fitted, group = ID))
p2 <- p1 + geom_line(aes(linetype = Group))
p3 <- p2 + scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 20))
p4 <- p3 + scale_y_continuous(name = "Weight (grams)")
p5 <- p4 + theme_bw() + theme(legend.position = "right")
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p7 <- p6 + ggtitle("Fitted")
graph2 <- p7
graph1; graph2
Figures above underlines how well the interaction model fits the observed data. (The fitted values for each rat include “predicted” values of the u and v random effects for the rat; details of how these predicted values are calculated are given in Rabe-Hesketh and Skrondal, 2012.) In conclusion all groups gained weight. The estimated regression parameters for the interaction indicate that the growth rate slopes are considerably higher for rats in group 2 than for rats in group 1 but less so when comparing group 3 rats with those in group 1.
BPRS data includes 40 male subjects wjo were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.The BPRS data includes 360 observation and 5 variables.
# Look at the (column) names of BPRS
names(BPRSL)
## [1] "treatment" "subject" "weeks" "bprs" "week"
# Look at the structure of BPRS
str(BPRSL)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
# Print out summaries of the variables
summary(BPRSL)
## treatment subject weeks bprs week
## 1:180 1 : 18 Length:360 Min. :18.00 Min. :0
## 2:180 2 : 18 Class :character 1st Qu.:27.00 1st Qu.:2
## 3 : 18 Mode :character Median :35.00 Median :4
## 4 : 18 Mean :37.66 Mean :4
## 5 : 18 3rd Qu.:43.00 3rd Qu.:6
## 6 : 18 Max. :95.00 Max. :8
## (Other):252
First of all we draw plots of the BPRS values for all 40 men, differentiating between the treatment groups into which the men have been randomized (Figure 7)
# Figure 7
p1 <- ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=4))
p3 <- p2 + facet_grid(. ~ treatment, labeller = label_both)
p4 <- p3 + theme_bw() + theme(legend.position = "none")
p5 <- p4 + theme(panel.grid.minor.y = element_blank())
p6 <- p5 + scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
p6
# Standardise the scores:
BPRSL <- BPRSL %>%
group_by(week) %>%
mutate( stdbprs = (bprs - mean(bprs))/sd(bprs) ) %>%
ungroup()
glimpse(BPRSL)
## Observations: 360
## Variables: 6
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "we...
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 6...
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ stdbprs <dbl> -0.4245908, 0.7076513, 0.4245908, 0.4953559, 1.6983632, 0...
# Figure 8
p1 <- ggplot(BPRSL, aes(x = week, y = stdbprs, linetype = subject))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=4))
p3 <- p2 + facet_grid(. ~ treatment, labeller = label_both)
p4 <- p3 + theme_bw() + theme(legend.position = "none")
p5 <- p4 + theme(panel.grid.minor.y = element_blank())
p6 <- p5 + scale_y_continuous(name = "standardized bprs")
p6
In figure 7 is non-standardized plot and figure 8 is with standardized values. In figure is easier to see effect of the treatments because all values are standardized to equal. As we can see after standardizing it is still little bit a hard figure out the effect between the treatments. A possible alternative to plotting the mean profiles as in figure 9 to graph side-by-side box plots of the observations at each time point. As well as in figure 10 we can clearly see the presence of some possible “outliers” at a number of time points.
# Figure 9
# Number of weeks, baseline (week 0) included:
n <- BPRSL$week %>% unique() %>% length()
# Make a summary data:
BPRSS <- BPRSL %>%
group_by(treatment, week) %>%
summarise( mean=mean(bprs), se=sd(bprs)/sqrt(n) ) %>%
ungroup()
glimpse(BPRSS)
## Observations: 18
## Variables: 4
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2
## $ week <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5, 6, 7, 8
## $ mean <dbl> 47.00, 46.80, 43.55, 40.90, 36.60, 32.70, 29.70, 29.80, 2...
## $ se <dbl> 4.534468, 5.173708, 4.003617, 3.744626, 3.259534, 2.59576...
p1 <- ggplot(BPRSS, aes(x = week, y = mean, linetype = treatment, shape = treatment))
p2 <- p1 + geom_line() + scale_linetype_manual(values = c(1,2))
p3 <- p2 + geom_point(size=3) + scale_shape_manual(values = c(1,2))
p4 <- p3 + geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3)
p5 <- p4 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p6 <- p5 + theme(legend.position = c(0.8,0.8))
p7 <- p6 + scale_y_continuous(name = "mean(bprs) +/- se(bprs)")
p7
# Figure 10
p1 <- ggplot(BPRSL, aes(x = factor(week), y = bprs, fill = treatment))
p2 <- p1 + geom_boxplot(position = position_dodge(width = 0.9))
p3 <- p2 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p4 <- p3 + theme(legend.position = c(0.8,0.8))
p5 <- p4 + scale_x_discrete(name = "week")
# Black & White version:
#p6 <- p5 + scale_fill_grey(start = 0.5, end = 1)
p5
Let’s look at boxplots of the measure (mean bprs in weeks 1 to 8) for each treatment group. The resulting plot is shown in figure 11. We see some outliers. Due the that let’s draw the next figure without outliers (Figure 12).
# Figure 11
# Make a summary data of the post treatment weeks (1-8)
BPRSL8S <- BPRSL %>%
filter(week > 0) %>%
group_by(treatment, subject) %>%
summarise( mean=mean(bprs) ) %>%
ungroup()
glimpse(BPRSL8S)
## Observations: 40
## Variables: 3
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ mean <dbl> 41.500, 43.125, 35.375, 52.625, 50.375, 34.000, 37.125, 3...
p1 <- ggplot(BPRSL8S, aes(x = treatment, y = mean))
p2 <- p1 + geom_boxplot()
p3 <- p2 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p4 <- p3 + stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white")
p5 <- p4 + scale_y_continuous(name = "mean(bprs), weeks 1-8")
p5
# Figure 12
# Remove the outlier:
BPRSL8S1 <- BPRSL8S %>%
filter(mean < 60)
glimpse(BPRSL8S1)
## Observations: 39
## Variables: 3
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ mean <dbl> 41.500, 43.125, 35.375, 52.625, 50.375, 34.000, 37.125, 3...
p1 <- ggplot(BPRSL8S1, aes(x = treatment, y = mean))
p2 <- p1 + geom_boxplot()
p3 <- p2 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p4 <- p3 + stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white")
p5 <- p4 + scale_y_continuous(name = "mean(bprs), weeks 1-8")
p5
Next we are going to test is there any diffrences between the treatment groups. The results are shown in table 1 The t-test confirms the lack of any evidence for a group difference. Also the 95% confidence interval is wide and includes the zero, allowing for similar conclusions to be made. T-test made with data without outliers.
# Without the outlier, apply Student's t-test, two-sided:
t.test(mean ~ treatment, data = BPRSL8S1, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by treatment
## t = 0.52095, df = 37, p-value = 0.6055
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.232480 7.162085
## sample estimates:
## mean in group 1 mean in group 2
## 36.16875 34.70395
Baseline measurements of the outcome variable in a longitudinal study are often correlated with the chosen summary measure and using such measures in the analysis can often lead to substantial gains in precision when used appropriately as a covariate in an analysis of covariance (see Everitt and Pickles,2004). We can illustrate the analysis on the data in table 2 using the BPRS value corresponding to time zero taken prior to the start of treatment as the baseline covariate. The results are shown in table 2. We see that the baseline BPRS is strongly related to the BPRS values taken after treatment has begun, but there is still no evidence of a treatment difference even after conditioning on the baseline value.
# Table 2
# Add the baseline from the original data as a new variable to the summary data
BPRSL8S2 <- BPRSL8S %>%
mutate(baseline = BPRS$week0)
# Fit the linear model with the mean as the response
fit <- lm(mean ~ baseline + treatment, data = BPRSL8S2)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 1868.07 1868.07 30.1437 3.077e-06 ***
## treatment 1 3.45 3.45 0.0557 0.8148
## Residuals 37 2292.97 61.97
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In coclusion our results indicates that there is no differences between the treatments during the eight weeks period even we taken account for baseline values.